gguf-py : support lazy tensor splitting#12809
Merged
Conversation
Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.
Contributor
|
My sha256sum is 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2 I may be able to pull your changes and see if it's different, but from looking at previously uploaded conversions it doesn't look like any folder metadata gets in there, and I don't add any of my own so should match up |
3 tasks
Contributor
|
sha256 of conversion with this change: 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2 so it matches, woo! The conversion wasn't WAY faster, still took well over an hour, I think about 1:30, but still faster than before which was over 1:45 🤷 |
ngxson
approved these changes
Apr 8, 2025
danielhanchen
added a commit
to unslothai/llama.cpp
that referenced
this pull request
Apr 8, 2025
tastelikefeet
added a commit
to tastelikefeet/llama.cpp
that referenced
this pull request
Apr 10, 2025
* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h
colout
pushed a commit
to colout/llama.cpp
that referenced
this pull request
Apr 21, 2025
* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint
timwu
pushed a commit
to timwu/llama.cpp
that referenced
this pull request
May 5, 2025
* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint
timwu
pushed a commit
to timwu/llama.cpp
that referenced
this pull request
Dec 20, 2025
* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.
As explained in #12791 (comment), this will likely help reducing the RAM usage when converting Llama4, since the approach in #12791 uses
torch.spliton the FFN projections.TODO:
Make sure to read the contributing guidelines before submitting a PR